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Abstract 

The  way  we  model  semantic  similarity  is  closely  tied  to  our  understanding  of  linguistic 
representations.  We  present  several  models  of  semantic  similarity,  based  on  differing  rep¬ 
resentational  assumptions,  and  investigate  their  properties  via  comparison  with  human 
ratings  of  verb  similarity.  The  results  offer  insight  into  the  bases  for  human  similarity 
judgments  and  provide  a  testbed  for  further  investigation  of  the  interactions  among  syn¬ 
tactic  properties,  semantic  structure,  and  semantic  content. 
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Abstract 

The  way  we  model  semantic  similarity  is  closely  tied 
to  our  understanding  of  linguistic  representations.  We 
present  several  models  of  semantic  similarity,  based  on 
differing  representational  assumptions,  and  investigate 
their  properties  via  comparison  with  human  ratings  of 
verb  similarity.  The  results  offer  insight  into  the  bases 
for  human  similarity  judgments  and  provide  a  testbed 
for  further  investigation  of  the  interactions  among  syn¬ 
tactic  properties,  semantic  structure,  and  semantic  con¬ 
tent. 

Introduction 

The  way  we  model  semantic  similarity  is  closely  tied  to 
our  understanding  of  how  linguistic  representations  are 
acquired  and  used.  Some  models  of  similarity,  such  as 
Tverskv’s  (1977),  assume  an  explicit  set  of  features  over 
which  a  similarity  measure  can  be  computed,  and  re¬ 
cent  computational  methods  for  measuring  word  similar¬ 
ity  can  be  thought  of  as  an  update  of  this  idea  on  a  large 
scale,  representing  words  in  terms  of  distributional  fea¬ 
tures  acquired  via  analysis  of  text  corpora  (e.g.,  Brown, 
Della  Pietra,  deSouza,  Lai,  &  Mercer,  1992;  Schiitze, 
1993).  Other  methods,  following  in  the  semantic  net¬ 
works  tradition  of  Quillian  (1968),  focus  less  on  explicit 
features  and  more  on  relationships  among  lexical  items 
within  a  conceptual  taxonomy,  sometimes  going  beyond 
taxonomic  relationships  to  also  take  advantage  of  fre¬ 
quency  information  derived  from  corpora  (e.g.,  Rada, 
Mili,  Bicknell,  &  Blettner,  1989;  Resnik,  1999). 

Although  some  of  these  approaches  are  not  explicitly 
designed  as  cognitive  models,  we  have  proposed  that  pre¬ 
diction  of  human  similarity  can  provide  a  useful  point 
of  comparison  for  computational  measures  of  similarity, 
noting  that  one  must  be  aware  that  such  comparisons 
can  be  quite  sensitive  to  the  specific  choice  of  test  items 
(Resnik,  1999).  To  date,  we  are  only  aware  of  compar¬ 
isons  having  been  done  using  noun  similarity. 

In  this  paper,  we  consider  the  problem  of  measuring 
the  semantic  similarity  of  verbs.  Verb  similarity  is  in 
many  respects  a  different  problem  from  noun  similar¬ 
ity,  because  verb  representations  are  generally  viewed  as 
possessing  properties  that  nouns  do  not,  such  as  syn¬ 
tactic  subcategorization  restrictions,  selectional  prefer¬ 
ences,  and  event  structure,  and  there  are  dependencies 
among  these  properties.1  This  means  that  particular 

1  Admittedly,  the  relevant  contrast  may  turn  out  not  to 


care  must  be  taken  in  selecting  items,  as  discussed  below, 
and  it  also  means  that  the  same  computational  measures 
may  be  capturing  different  properties  for  verbs  than  for 
nouns.  For  example,  the  IS-A  relationship  in  WordNet’s 
verb  taxonomy  (Fellbaum,  1998),  central  in  the  compu¬ 
tation  of  some  measures,  signifies  generalization  accord¬ 
ing  to  manner,  as  in  devour  IS-A  eat ;  concomitantly,  the 
verb  taxonomy  is  considerably  wider  and  shallower  than 
WordNet’s  noun  taxonomy.  Similarly,  measures  based 
on  syntactic  dependencies  may  be  sensitive  to  syntactic 
adjuncts,  such  as  locative  and  temporal  modifiers,  that 
occur  predominantly  with  verbs  rather  than  with  nouns. 

In  what  follows,  we  first  discuss  several  different  mea¬ 
sures  of  word  similarity  and  their  properties.  We  then 
describe  an  experiment  designed  to  obtain  human  sim¬ 
ilarity  ratings  for  pairs  of  verbs,  discuss  the  fit  of  the 
alternative  measures  to  the  human  ratings,  and  suggest 
some  implications  of  these  results  for  future  work. 

Models  of  Verb  Similarity 

We  consider  three  classes  of  similarity  measure,  corre¬ 
sponding  to  three  kinds  of  lexical  representation.  In  the 
first,  verbs  are  associated  with  nodes  in  a  semantic  net¬ 
work.  In  the  second,  verbs  are  represented  by  distri¬ 
butional  syntactic  co-occurrence  features  obtained  via 
analysis  of  a  corpus.  In  the  third,  verbs  are  associated 
with  lexical  entries  represented  according  to  a  theory  of 
lexical  conceptual  structure.  These  classes  of  represen¬ 
tation  can  be  viewed  as  occupying  three  different  points 
on  the  spectrum  from  non-syntactic  to  syntactically  rel¬ 
evant  facets  of  verb  meaning. 

Taxonomic  Models 

Taxonomic  models  of  lexical  and  conceptual  knowledge 
have  a  long  history.  In  this  work  we  use  WordNet  version 
1.5,  a  large  scale  taxonomic  representation  of  concepts 
lexicalized  in  English.  As  a  model  of  the  lexicon,  Word¬ 
Net’s  verb  hierarchy  is  limited  by  design  to  paradigmatic 
relations,  in  explicit  contrast  to  attempts  to  organize  se¬ 
mantically  coherent  verb  classes  through  shared  syntac¬ 
tic  behavior. 

The  simplest  and  most  tra  ditional  measure  of  semantic 
similarity  in  a  taxonomy  counts  the  number  of  edges  in- 

be  part-of-speech  per  se;  one  could  argue  that  some  nouns 
carry  similar  kinds  of  participant  information,  observing,  for 
example,  that  x’s  gift  of  y  to  z  parallels  x  gave  y  to  z.  We  are 
not  attempting  to  address  that  issue  here. 


tervening  between  nodes  (“edge  counting”).  A  distance 
ill  edges  is  converted  to  similarity  by  subtracting  from 
the  maximum  possible  distance  in  the  taxonomy,  giving 
the  following  measure  of  distance  between  verbs  w )  and 
W2'- 


wsinigd —  (2  x  max) 


minlen(ci,  C2) 

Cl  ,  C2 


(1) 


where  ci  ranges  over  s(tni),  C2  ranges  over  s(w 2),  MAX  is 
the  maximum  depth  of  the  taxonomy,  and  len(ci,C2)  is 
the  length  of  the  shortest  path  from  ci  to  C2,  with  s(w) 
denoting  the  set  of  concepts  in  the  taxonomy  that  rep¬ 
resent  senses  of  word  w.  If  all  senses  of  w±  and  w?  are  in 
separate  sub-taxonomies  of  the  WordNet  verb  hierarchy 
their  edge-count  similarity  is  defined  to  be  zero. 

The  simple  edge-counting  approach  has  well  known 
problems,  and  arguments  have  been  made  for  the  follow¬ 
ing  measure  of  semantic  similarity  between  concepts  in  a 
taxonomy  based  on  shared  information  content  (Resnik, 
1999): 

siminfol(Cl’c2)  =  ce™(“C2)[-loSP(c)],  (2) 


where  S^ci ,  €2)  is  the  set  of  concepts  that  subsume  both 
ci  and  C2,  and  —  logp(c)  quantifies  the  “information  con¬ 
tent”  of  node  c.  This  yields  a  measure  of  verb  similarity 

wsiminfol (M >  w2)  =  [simjnfoi  (ci ,  c2 )]  ,  (3) 

where  ci  ranges  over  «(«•• )  and  c 2  ranges  over  s(w 2),  and 
p(c)  is  estimated  by  observing  frequencies  in  a  corpus.2 
Intuitively,  the  quantity  defined  in  (3)  measures  the  max¬ 
imum  overlap  in  information  between  the  words  being 
compared.  When  two  words  are  not  very  similar,  the  in¬ 
formation  content  of  their  most  informative  subsumer 
(the  node  c  maximizing  —  logp(c))  is  low:  that  sub¬ 
sumer  resides  high  in  the  taxonomy  and  thus  has  high 
probability,  implying  low  information  content.  In  the 
most  extreme  case,  the  most  informative  subsumer  is 
just  the  TOP  node  of  the  taxonomy,  in  which  case  the 
probability  is  1  and  the  shared  information  content  (and 
hence  similarity)  is  0.  When  two  words  are  similar,  that 
means  there  is  a  node  lower  in  the  taxonomy  that  sub¬ 
sumes  them  both;  being  lower  in  the  taxonomy  its  prob¬ 
ability  is  lower  and  therefore  its  information  content  is 
higher.  Crucially,  structural  notions  such  as  “lower” 
and  “higher” ,  and  the  number  of  intervening  arcs  be¬ 
tween  nodes,  play  no  actual  role  in  this  model  of  sim¬ 
ilarity.  As  a  result,  unlike  edge  counting,  this  measure 
does  not  fall  prey  to  the  rampant  variation  in  density 
within  any  realistic  conceptual  taxonomy,  where  a  single 
IS-A  link  could  represent  a  tiny  semantic  distance  (e.g. 
ballpointjpen  is-A  pen )  or  a  very  large  semantic  distance 
(e.g.  toy  is-A  artifact).3 

Lin  (1998)  argues  for  an  alternative  information-based 
measure  of  similarity  that,  when  applied  to  a  taxonomy, 


2 For  taxonomic  measures  described  in  this  section,  prob¬ 
abilities  of  nodes  in  WordNet  1.5  were  estimated  on  the  basis 
of  word  frequencies  in  the  Brown  Corpus  (Francis  fc  Kucera, 
1982). 

3Examples  are  from  WordNet  1.5,  where  artifact  signifies 

a  man-made  object. 


closely  resembles  the  measure  just  described.  It  differs 
in  normalizing  the  shared  information  content  using  the 
sum  of  the  unshared  information  content  of  each  item 
being  compared: 


siminfo2(ci’c2.) 


2  x  log  p(p|;;  Ci) 
log  p(ci)  -flog  p(c2) 


where  the  Ci  are  the  “maximally  specific  superclasses” 
of  both  ci  and  C2.  As  a  result  of  this  normalization,  the 
measure  possesses  some  desirable  properties,  such  as  a 
fixed  range  from  0  to  1.  Word  similarity  wsi  m;n  p()2  is 
defined  analogously  to  Definition  (3). 


Distributional  Co-Occurrence  Model 

Information-based  measures  of  similarity  can  be  applied 
to  representations  other  than  taxonomic  structures.  In¬ 
deed,  Lin  demonstrates  the  generality  of  the  idea  by 
showing  how  such  a  measure  can  be  used  to  measure 
not  only  taxonomic  distance  but  also  string  similarity 
and  the  distance  between  feature  sets  a  la  Tversky.  The 
latter  approach  is  illustrated  by  representing  words  as 
collections  of  syntactic  co-occurrence  features  obtained 
by  parsing  a  corpus.  For  example,  both  the  noun  duty 
and  the  noun  sanction  would  have  feature  sets  contain¬ 
ing  the  feature  subj-of (include),  but  only  sanction  would 
have  the  feature  adj-mod(economic),  since  “economic 
sanctions”  appears  in  the  corpus  but  “economic  duties” 
does  not.  Because  these  features  include  both  labeled 
syntactic  relationships  and  the  lexical  items  filling  argu¬ 
ment  roles,  the  underlying  representational  model  can 
be  thought  of  as  capturing  both  syntactic  and  semantic 
components  of  verb  meaning. 

Lin  computes  the  quantity  of  shared  information  as 
the  information  in  the  intersection  of  the  distributional 
feature  sets  for  the  two  items  being  compared.  This 
yields  the  following  measure: 


wsimdistrib(wi*%0 


2  x  I(F(w  1)  n  F(w 2))  . 

/(*>!))  +  I(F(w 2))  U 


where  F(u>i )  is  the  feature  set  associated  with  word  Wi, 
and  where  I(S),  the  quantity  of  information  in  a  feature 
set  <S,  is  computed  as  I(S)  =  —  1°SP(/)-4  1 11  the 

experiments  described  here,  we  use  similarity  values  ob¬ 
tained  for  verb  pairs  using  Lin’s  implementation  of  his 
model,  with  his  feature  sets  and  probabilities  obtained 
via  analysis  of  a  22-million- word  corpus  of  newswire  text. 


Semantic  Structure  Model 

Our  third  method  for  assessing  the  semantic  similarity 
of  verbs  relies  on  elaborated  representations  of  verb  se¬ 
mantics  according  to  the  theory  of  lexical  conceptual 
structure,  or  LCS  (Dorr,  1993;  Jackendoff,  1983).  LCS 
representations  make  an  explicit  distinction  between  se¬ 
mantic  structure ,  which  characterizes  the  grammatically 
relevant  facets  of  verb  meaning,  from  semantic  content, 
which  characterizes  idiosyncratic  information  associated 
with  the  verb  but  not  reflected  in  its  syntactic  behavior. 

4 Note  the  assumption  that  features  are  independent,  per¬ 
mitting  the  summation  of  log  probabilities. 


This  difference  between  semantic  structure  and  seman¬ 
tic  content  plays  an  important  role  in  current  research 
on  lexical  representation  (e.g.  G  rims  haw.  1993;  Pinker, 
1989;  Rappa.port,  Laughren,  &  Levin,  1993).  We  take 
advantage  of  this  distinction  here  to  derive  a  measure 
that  focuses  exclusively  on  similarity  of  semantic  struc¬ 
ture  as  disentangled  from  semantic  content. 

To  illustrate  with  a  simple  example,  within  an  LCS 
representational  system  roll  and  slide  might  both  have 
semantic  structure  indicating  a  change  of  location,  e.g., 

(g°loc  x 

('"loc  x  (atl  oc  x  y)) 

(fromioc  x  (atj 

oc  x  z)) 

(manner  (M))), 

and  differ  only  in  the  value  (M)  —  an  element  of  seman¬ 
tic  content  within  the  semantic  structure  —  indicating 
the  manner  of  motion  (either  (sliding)  or  (rolling)). 
Such  regularities  in  semantic  structure  are  argued  to 
provide  an  explanation  for  systematic  relationships  be¬ 
tween  meaning  and  syntactic  realization  (Levin  &  Rap¬ 
paport  Hovav,  1998). 

If  those  regularities  are  a  part  of  verb  lexical  repre¬ 
sentations,  then  they  also  plausibly  influence  ratings  of 
verb  similarity,  and  the  question  is  how  to  assess  similar¬ 
ity  between  two  such  structured  representations.  Lin’s 
work  provides  one  plausible  answer:  decomposing  com¬ 
plex  representations  into  (pseudo-)independent  feature 
sets  and  then  comparing  feature  sets.5  Our  method  of 
decomposition  was  particularly  simple,  recursively  cre¬ 
ating  an  independent  feature  from  each  primitive  com¬ 
ponent  of  the  representation  and  the  “head”  of  its  subor¬ 
dinates.  So,  for  example,  the  feature  set  representation 
of  roll  would  contain  six  features: 

[gojoc  tojoc  fromjoc  manner] 

[toioc  x  atloc] 

[atioc x  y] 

[fromioc  x  at  loc¬ 
i'1' loc  x  z] 

[manner  (rolling)]. 

The  features  of  slide  would  be  identical  but  for  the  last 
feature,  which  would  instead  be  [manner  (sliding)],  and 
the  nearly  complete  overlap  between  the  feature  sets  for 
the  two  verbs  captures  the  fact,  that  the  semantic  distinc¬ 
tion  between  this  particular  pair  of  verbs  rests  entirely 
on  semantic  content  and  not  semantic  structure. 

Since  we  had  available  to  us  a  large  lexicon  of  LCS  rep¬ 
resentations  for  verbs  in  English  (Dorr  &  Olsen,  1996, 
1997),  containing  thousands  of  lexical  entries,  we  esti¬ 
mated  the  probability  of  each  feature  by  counting  feature 
occurrences  within  the  lexicon.  We  define  the  similarity 
of  two  LCS  lexicon  entries  e\  and  e2  using  the  shared 
information  content  of  their  feature  sets: 

simlcs(ei,e2)  =  /(/'(/ ;.)  C  />  (<2))  (6) 

6  We  are  grateful  to  Dekang  Lin  for  suggesting  this  ap¬ 
proach  to  us. 


using  I(S)  as  in  (5),  and  we  compute  wsini|(>(  tv- .  ir-j) 
as  the  maximum  value  of  simjcs  taken  over  the  cross 
product  of  all  the  words’  lexical  entries.6 

It  is  worth  emphasizing  that  this  similarity  mea¬ 
sure  considers  only  semantic  structure,  not  seman¬ 
tic  content,  and  therefore  only  syntactically  relevant 
components  of  meaning  enter  into  the  computation. 
For  example,  in  the  comparison  of  LCS  entries  for 
slide  and  roll,  L(ei)  H  f(e2)  will  never  contain  either 
[manner  (rolling)]  or  [manner  (sliding)],  and  there¬ 
fore  any  potential  similarities  or  differences  between  the 
content  elements  —  the  physical  aspects  of  sliding  mo¬ 
tion  versus  rolling  motion  based  on  real-world  knowledge 

—  are  excluded  from  the  model. 

Experiment 

In  order  to  assess  alternative  computational  models  of 
similarity,  we  collected  human  ratings  of  similarity  for 
pairs  of  verbs,  following  a  design  after  that  of  Miller  and 
Charles  (1991).  Considering  the  additional  complexities 
in  the  verb  lexicon,  however,  the  selection  of  materials 
required  considerable  care:  we  were  careful  to  pay  close 
attention  to  syntactic  subcategorization,  thematic  grids, 
and  aspectual  class  information,  as  described  below,  in 
order  to  limit  the  possible  dimensions  across  which  the 
two  verbs  in  a  pair  could  differ  and  to  focus  on  semantic 
similarity.  We  also  designed  two  versions  of  the  task, 
with  and  without  presentation  of  verbs  in  context,  in 
order  to  investigate  the  extent  to  which  contextual  nar¬ 
rowing  of  verbs’  senses  affects  ratings  of  similarity. 

Participants.  Participants  were  10  volunteers,  all  na¬ 
tive  speakers  of  English,  ranging  in  age  from  24  to  53, 
without  significant  background  in  psychology  or  linguis¬ 
tics.  All  participated  by  e-mail. 

Materials.  In  constructing  the  set  of  verb  pairs  for 
similarity  ratings,  we  began  with  the  set  of  verbs  in  a 
large  lexicon  of  LCS  entries,  containing  entries  for  4900 
verbs.  Verb  entries  in  the  lexicon  contain  information 
about  both  aspectual  features  (dynamicity,  dura.tivity, 
telicity;  Olsen,  1997)  and  thematic  grid  (identifying 
whether  or  not  a  verb  takes  an  agent,  theme,  goal,  etc.) 

—  for  example,  the  verb  broil  requires  both  an  agent  and 
a  theme,  and  is  marked  as  both  durative  and  telic  but 
not  dynamic.  For  subcategorization  information,  we  re¬ 
ferred  to  the  Collins  Cobuild  dictionary  (Sinclair,  1995), 
using  the  subcategorization  frame  for  the  first  listed  verb 
sense. 

To  construct  verb  pairs,  we  began  by  eliminating  all 
verbs  whose  thematic  grid  did  not  require  a  theme,  in 
order  to  limit  the  range  of  variation  in  thematic  grids.' 

6  Although  our  probability  estimate  counts  features  within 
a  set  of  types  (entries  in  a  large  lexicon)  rather  than  tokens 
(verb  instances  in  a  large  corpus),  inspection  of  the  estimated 
probabilities  suggests  that  frequent  features  are  suitably  dis¬ 
counted,  having  low  information  content,  and  rare  features 
are  highly  informative.  Corpus-based  estimates  are  a  matter 
for  future  work. 

7 All  verbs  require  an  agent,  so  the  remaining  variation  is 
in  the  presence  or  absence  of  oblique  roles  such  as  GOAL. 


We  then  grouped  the  full  set  of  verbs  into  eight  lists 
corresponding  to  the  eight  possible  combinations  of  the 
three  aspectual  features,  and  restricted  our  attention  to 
the  four  most  numerous  lists.8  Within  each  of  those 
four  lists,  we  created  12  pairs  of  verbs  subject  to  the 
constraint  that  the  verbs’  associated  subcategorization 
frames  had  to  match,  so  as  to  avoid  effects  of  purely  syn¬ 
tactic  similarity.  Items  were  selected  to  span  the  range 
from  low-  to  high-similarity  verb  pairs. 

In  summary,  a  set  of  48  verb  pairs  was  constructed 
so  that  (i)  both  verbs  in  every  pair  require  a  theme, 
(ii)  both  verbs  have  the  same  subcategorization  frame, 
and  (iii)  both  verbs  come  from  the  same  aspectual  class. 
Verbs  on  the  list  were  all  given  in  the  past  tense.  In 
order  to  avoid  ordering  effects,  half  the  subjects  in  each 
condition  saw  items  in  a  random  order,  and  the  other 
half  saw  the  items  in  the  reverse  order. 

To  assess  the  effects  that  contextual  narrowing  of  verb 
senses  might  have  on  similarity  ratings,  the  materials  as 
just  described  were  duplicated  in  order  to  create  No  Con¬ 
text  and  Context  conditions.  The  conditions  were  iden¬ 
tical  except  that  in  the  Context  condition,  each  item  was 
accompanied  by  an  example  sentence  for  each  verb  illus¬ 
trating  the  verb’s  intended  sense.  Each  example  sentence 
came  from  the  corresponding  verb  entry  in  the  Collins 
Cobuild  dictionary.  For  example,  the  example  sentence 
for  loose n  was  “He  loosened  his  seat  belt.’’ 


Procedure.  The  10  subjects  were  split  evenly  into 
Context  and  No  Context  groups.  Subjects  in  the  No 
Context  group  were  given  the  set  of  48  verb  pairs, 
without  example  sentences,  and  asked  to  compare  their 
meanings  on  a  scale  of  0-5,  where  0  means  that  the  verbs 
are  not  similar  at  all  and  5  indicates  maximum  similar¬ 
ity.  Subjects  were  explicitly  asked  to  ignore  similarities 
in  the  sound  of  the  verb  and  similarities  in  the  num¬ 
ber  and  type  of  letters  that  make  up  the  verb.  Subjects 
were  also  asked  explicitly  to  rate  similarity  rather  than 
relatedness,  with  the  instructions  giving  an  example  of 
the  distinction.  (For  example,  pay  and  eat  are  related 
in  that  they  are  things  we  do  in  restaurants,  but  they 
are  not  particularly  similar.)  Since  some  verbs  in  the  set 
have  low  frequency,  a  “don’t  know”  box  was  included  for 
subjects  to  mark  if  they  were  unsure  of  the  meaning  of 
either  verb.  There  was  no  time  limit  on  the  task,  which 
tended  to  take  approximately  20  minutes. 

Subjects  in  the  Context  group  were  given  exactly  the 
same  task,  but  using  the  Context  materials,  i.e.  with 
each  verb  accompanied  by  an  example  sentence  illustrat¬ 
ing  the  intended  sense.  As  in  the  previous  condition,  two 
orders  of  presentation  were  used  within  this  condition  to 
avoid  ordering  effects. 

Each  computational  similarity  measure  took  the  set 
of  verb  pairs  as  input,  without  context,  and  computed  a 
similarity  score  for  each. 


8These  were  {durative},  {durative,  dynamic}, 
{ dynamic, telic},  {durative, dynamic, telic}.  Verbs  could  and 
did  appear  on  multiple  lists. 


Table  1:  Comparing  sets  of  ratings 


wsim 

Context 

No  Context 

edge 

.720 

.675 

infol 

.779 

.658 

info2 

.768 

.668 

dist.rib  .458  .488 


ics - i3T3 - :m 


Combined 

.872 

.785 

Inter-rater 

.798 

.764 

Results  and  Discussion.  In  order  to  judge  the  de¬ 
gree  to  which  sets  of  similarity  ratings  are  predictive  of 
each  other,  we  use  a  similarity  coefficient  computed  as 
Pearson’s  r.  Table  1  provides  a  summary  showing  r  for 
each  computational  model  as  compared  to  the  mean  of 
the  human  subject  ratings  in  the  Context  and  No  Con¬ 
text  conditions.9 

The  Combined  row  of  the  table  shows  the  value  of 
multiple  R  when  the  five  computational  measures  are 
compared  with  human  ratings  using  a  multiple  regres¬ 
sion  (see  below),  and  the  Inter-rater  row  of  the  ta¬ 
ble  shows  human  average  inter-rater  agreement,  mea¬ 
sured  by  r,  using  leave-one-out  resampling  (Weiss  &  Ku- 
likowski,  1991). 

Examining  these  figures,  we  first  consider  each  com¬ 
putational  model  separately.  It  is  unsurprising  that  the 
similarity  measure  based  on  LCS  representations  fares 
worst,  given  the  design  of  the  experiment:  the  verb  pairs 
were  selected  so  as  to  eliminate  differences  of  sub  cat¬ 
egorization  frame,  aspectual  class,  and  thematic  grid, 
ruling  out  a  priori  pairs  that  differ  interestingly  with 
respect  to  semantic  structure.  The  distributional  mea¬ 
sure  based  on  syntactic  co-occurrence  features  may  be 
a  victim  of  its  dependence  on  a  particular  corpus,  and 
of  data  sparseness  —  for  example,  glaring  divergences 
with  human  ratings  include  some  verb  pairs  containing 
some  lower-frequency  words,  such  as  embellish  /  decorate 
and  dissolve/ dissipate.  Turning  to  the  taxonomic  meth¬ 
ods,  the  information-based  approaches  appear  superior 
to  edge  counting  in  the  Context  condition,  consistent 
with  previous  work  on  noun  similarity,  though  in  the  No 
Context  condition  there  are  no  clear  differences.  We  sus¬ 
pect  a  difference  will  emerge  with  a  larger  set  of  items, 
but  this  remains  to  be  seen.  Our  inspection  of  by-item 


9  From  the  full  set  of  items,  10  verb  pairs  were  excluded 
because  some  participant  did  not  know  the  meaning  of  one  or 
the  other  verb.  Moreover,  in  preparation  of  the  final  version 
of  this  paper,  we  discovered  that  1 1  verb  pairs  inadvertently 
had  been  included  despite  failing  to  strictly  match  the  crite¬ 
ria  described  in  the  Materials  section  or  having  other  minor 
errors  of  presentation,  and  these  are  now  excluded,  as  well. 
Although  this  is  a  large  number  of  excluded  items,  we  con¬ 
sider  them  quite  unlikely  to  have  affected  participants’  judg¬ 
ments  since  the  excluded  pairs  were  distributed  almost  per¬ 
fectly  evenly  over  the  four  verb  lists  and  varied  across  degrees 
of  similarity,  and  since  the  pattern  of  results  was  unaffected. 
We  report  all  quantitative  results  in  the  paper  based  on  only 
the  27  non-excluded  verb  pairs. 


ratings  of  the  information  measures  suggests  strongly 
that  the  differences  between  the  unnornralized  and  nor¬ 
malized  information-based  measures  are  small  in  com¬ 
parison  to  the  role  played  by  the  structure  of  the  Word- 
Net.  verb  taxonomy. 

Comparison  of  human  raters  yields  several  interest¬ 
ing  observations.  First,  a  comparison  of  the  Context 
and  No  Context  mean  ratings  by  human  participants 
yields  r  =  .89,  which  provides  some  reassurance  that 
subjects  in  the  No  Context  condition  are  generally  inter¬ 
preting  the  verbs  in  the  same  sense  as  are  subjects  in  the 
Context  condition  —  where,  recall,  the  context  sentence 
encouraged  interpretation  according  to  the  first  listed 
verb  sense  in  the  Collins  Cobuild  dictionary.  Second, 
however,  average  inter  rater  agreement  in  the  two  con¬ 
ditions  (.79  and  .76)  is  much  lower  than  that  obtained 
in  a  noun  ratings  experiment  using  the  same  method, 
where  leave-one-out.  resampling  yielded  an  estimate  of 
r  =  .90  (Resnik,  1999).  This  may  reflect  the  small  sam¬ 
ple  size  in  each  group  ( N  =  5),  but  we  suspect  that  in 
actuality  it  is  evidence  that  word  similarity  is  harder  for 
subjects  to  quantify  for  verbs  than  for  nouns.  Third, 
we  find  that  subjects  in  the  No  Context  condition  have 
a  very  strong  tendency  to  assign  higher  similarity  rat¬ 
ings  to  the  same  pair  as  compared  to  subjects  in  the 
Context  condition,  as  determined  using  a  paired  2-test 
(. N  =  27,2(26)  =  4.49,  p  <  .0002). 

This  last  observation  is  consistent  with  the  idea  that 
subjects  in  the  No  Context  condition  are  accommodat¬ 
ing  verb  comparisons  —  allowing  for  more  flexible  in¬ 
terpretations  of  verb  meaning  —  in  a  way  not  available 
to  subjects  in  the  Context  condition  because  their  inter¬ 
pretations  are  constrained  by  the  context  sentence.  For 
example,  the  verb  pair  compose/ manufacture  has  a  mean 
rating  of  2.8  in  the  Context  condition,  and  the  context 
sentences  are  He  sees  the  whole,  not  the  various  lines 
that  compose  it  and  Many  factories  were  manufacturing 
desk  calculators.  In  the  No  Context  condition,  the  mean 
rating  for  this  pair  is  4.0,  likely  indicating  that  in  the 
process  of  comparison,  subjects  focused  on  available  se¬ 
mantic  elements  of  compose’ s  meaning  that  are  closest  to 
manufacture  (e.g.,  the  notion  of  composing  as  creating, 
She  composed  satirical  poems  for  the  New  Statesman) . 

As  a  preliminary  step  toward  combining  models,  we 
performed  a  multiple  regression  predicting  human  rat¬ 
ings  using  the  ratings  of  the  five  computational  models 
as  independent  variables,  with  the  results  shown  in  Ta¬ 
ble  1  as  Combined.  Although  we  have  not  extensively 
analyzed  these  data,  regressions  using  all  25  —  1  =  31 
combinations  of  models  show  that  the  highest  multiple  R 
is  obtained  when  all  five  models  are  combined,  that  the 
two  different  information-based  measures  are  making  es¬ 
sentially  the  same  contribution  to  the  combined  model 
(consistent  with  our  observation  that  WordNet  structure 
plays  the  dominant  role,  rather  than  details  of  the  mea¬ 
sure),  and  that  the  LCS  measure  contributes  little  for 
this  set  of  items.  Taking  these  observations  into  account, 
the  improvement  in  predictive  power  when  combining 
models  comes  from  distributional  and  information-based 


models  being  sensitive  to  at  least  some  different  informa¬ 
tion. 

General  Discussion 

The  experimental  results  reflect  the  fact  that  similar¬ 
ity  measures  model  different  aspects  of  verb  represen¬ 
tation  and  use.  Taxonomic  similarity  measures  place 
little  emphasis  on  verbs’  argument  structure,  empha¬ 
sizing  relationships  of  semantic  content;  for  example, 
drag  and  tug  appear  quite  close  in  the  taxonomy  (un¬ 
der  displace)  although  they  differ  significantly  in  seman¬ 
tic  structure  (e.g.  in  “the  tailpipe  dragged”  and  “the 
donkey  tugged”  the  syntactic  subjects  have  different  the¬ 
matic  roles).  Conversely,  semantic  structure  is  empha¬ 
sized  in  the  measure  based  on  LCS  representations  to  the 
exclusion  of  real-world  knowledge,  such  as  the  similarity 
of  the  physical  motions  of  dragging  and  tugging.  Distri¬ 
butional  similarity  based  on  syntactic  co-occurrence  fea¬ 
tures  is  a  combination,  capturing  elements  of  semantic 
structure  by  means  of  the  syntactic  relationships  (one- 
versus  two-participant  relationships),  and  also  indirectly 
capturing  elements  of  semantic  content  by  means  of  the 
lexical  items  co  occurring  in  those  syntactic  positions 
( tug  being  weighted  more  heavily  against  inanimate  sub¬ 
jects  than  drag,  for  example).  Based  on  the  performance 
of  the  models,  and  improved  predictive  power  of  the  mul¬ 
tiple  regression,  we  interpret  our  results  as  evidence  that 
human  ratings  of  similarity  are  sensitive  to  both  paradig¬ 
matic  and  syntagmatic  facets  of  verb  representation,  and 
we  believe  the  computational  models  are  capturing  rel¬ 
evant  aspects  of  verb  representation  in  order  to  make 
predictions  about  similarity  judgments. 

On  a  somewhat  speculative  note,  it  is  interesting  to 
briefly  examine  cases  where  the  computational  mod¬ 
els  fail  to  capture  similarities  identified  by  the  human 
raters.  Consider,  for  example,  items  unfold/ divorce, 
chill/ toughen,  initiate/ enter.  Based  on  the  WordNet 
taxonomy,  the  verbs  in  these  pairs  have  no  common  sub- 
sumer,  so  the  shared  information  content  is  zero;  nor  do 
the  distributional  or  LCS  measures  predict  that  they  are 
at  all  similar.  The  human  mean  ratings  are  low  (aver¬ 
aging  1.6,  1.4,  and  3.2,  respectively,  in  the  No  Context 
condition),  but  why  are  they  not  zero  —  and  why  are 
they  in  fact  higher  than  the  ratings  for  some  other  pairs, 
such  as  open/ inflate  (0.6),  where  one  could  also  iden¬ 
tify  reasons  for  believing  the  meanings  have  something 
in  common?  It  would  appear  that  in  these  cases  subjects 
are  finding  similarities  of  meaning  according  to  dimen¬ 
sions  that  we  have  not  yet  formalized.  The  apparent 
sense  extensions  verge  on  the  metaphorical:  one  can  de¬ 
scribe  divorce  as  the  unfolding  of  a  marriage,  observe  a 
person  chill  and  toughen  in  response  to  an  insult,  en¬ 
ter  a  group  by  being  initiated  into  it.  Capturing  those 
dimensions  of  similarity  in  our  models  will  require  a  bet¬ 
ter  understanding  than  we  have  at  present  of  how  word 
meanings  are  represented  and  organized. 

Even  for  the  time  being,  however,  the  work  described 
in  this  paper  offers  a  method  and  a  testbed  for  investi¬ 
gating  lexical  issues  that  can  go  well  beyond  the  present 
experiments.  We  chose  here  to  tightly  control  aspect  and 


syntactic  subcategorization  while  allowing  our  test  items 
to  differ  on  thematic  grids  and  vary  widely  with  respect 
to  semantic  content.  Having  validated  the  approach  — 
performance  being  consistent  with  what  one  would  pre¬ 
dict  of  the  alternative  models  given  the  design  of  the 
task  —  the  initial  work  opens  the  door  to  other  config¬ 
urations,  controlling  variation  among  subcategorization 
frames,  aspectual  features,  thematic  grids,  and  semantic 
content  in  other  combinations.  What  is  crucial  is  that 
implemented  models  of  similarity,  drawing  on  such  theo¬ 
retical  constructs,  yield  testable  predictions  that  can  be 
verified  through  careful  experimentation. 
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Appendix:  Verb  Pairs 


bathe 

kneel 

loosen 

open 

chill 

toughen 

neutralize 

energize 

compose 

manufacture 

obsess 

disillusion 

compress 

unionize 

open 

inflate 

crinkle 

boggle 

percolate 

unionize 

displease 

disillusion 

plunge 

bathe 

dissolve 

dissipate 

prick 

compose 

embellish 

decorate 

swagger 

waddle 

festoon 

decorate 

unfold 

divorce 

fill 

inject 

wash 

sap 

hack 

unfold 

weave 

enrich 

initiate 

enter 

whisk 

deflate 

lean 

kneel 

wiggle 

rotate 

loosen 

inflate 
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